Missing Data and Imputation Methods
This is a Data science Project
Tharun Teja Gandham & Mourya Rai Papolu (Advisor: Dr. Cohen)
2025-04-14
Introduction
The Problem of Missing Data
- Missing data is a pervasive issue in modern research.
- Common in fields like healthcare, finance, and social sciences.
- Causes include:
- Survey non-responses
- Data entry errors
- Equipment or system failures
- Irrelevant or uncollected data in context (e.g., EHR)
- Inaccurate handling results in:
- Biased statistical results
- Loss of statistical power
- Increased uncertainty
The Impact and Technical Relevance
- In healthcare, incorrect imputation affects diagnostic accuracy and patient outcomes.
- Electronic Health Records (EHRs) are highly prone to missing data.
- Types of Missingness:
- MCAR: Missing Completely At Random
- MAR: Missing At Random (dependent on observed data)
- MNAR: Missing Not At Random (dependent on unobserved data)
- Different missingness mechanisms demand different imputation techniques [@lee2024prevention].
Project Focus and Goal
- This project analyzes multiple imputation techniques on the Titanic dataset.
- Techniques compared:
- Mean/Mode Imputation
- Regression Imputation
- K-Nearest Neighbors (KNN)
- Key Evaluation Metrics:
- Predictive accuracy
- Preservation of data structure
- Bias reduction and interpretability
Literature Review
Missing Data Theory and Context
- Foundational theories identify three types of missingness [@salgado2016missing]:
- Each missingness type influences how data should be handled.
- MAR is most prevalent in clinical settings where missingness relates to observed characteristics (e.g., age, gender).
Traditional Imputation Approaches
- Mean substitution and mode imputation are simple but reduce variability.
- Listwise deletion removes incomplete records, decreasing statistical power.
- These techniques are quick but suitable only when missing data is MCAR or minimal [@alwateer2024missing; @yadav2024computational].
Advanced Imputation Techniques
- Multiple Imputation by Chained Equations (MICE):
- Creates multiple datasets with different estimates, then combines results.
- Popular in medical and epidemiological studies [@pedersen2017missing; @little2019statistical].
- MissForest:
- Non-parametric, uses Random Forests for mixed data types.
- Performs well under MCAR, MAR, and even MNAR [@stekhoven2012missforest].
- KNN Imputation:
- Estimates missing values using the average of similar records.
- Good for nonlinear, local relationships.
Applications and Challenges
- Choosing the right imputation method is critical for ensuring research validity, especially in structured datasets like clinical or survey data.
- Applications:
- Clinical decision-making based on patient EHRs
- Financial modeling with incomplete transaction logs
- Social science surveys with partial responses
- Challenges:
- High-dimensional data increases computational cost
- Difficult to assess accuracy without knowing true values
- Not all models generalize well across datasets [@afkanpour2024identify; @alwateer2024missing; @sterne2009multiple]
Modern Insights & Hybrid Approaches
- Hybrid techniques mix traditional and machine learning methods.
- Semi-parametric models like hot-deck and predictive mean matching are popular in survey and social science domains [@durrant2005imputation].
- Deep learning and hybrid frameworks outperform traditional models for complex datasets with high missingness [@afkanpour2024identify; @karim2024imputation].
Methods
1. Mean and Mode Imputation
- Mean Imputation: Replaces missing numerical values with the mean of the observed values.
- Mode Imputation: Replaces missing categorical values with the most frequent category.
- Advantages: Simple and fast.
- Limitations: Reduces variance, introduces bias if data is not MCAR.
- Commonly used as a baseline method in many studies [@pedersen2017missing].
2. Regression Imputation
- Uses regression models to predict missing values based on other variables in the dataset.
- Linear regression for numerical, logistic/multinomial for categorical variables.
- Advantages: Maintains variable relationships.
- Limitations: Assumes linearity, may underestimate variability [@farhangfar2007novel].
3. K-Nearest Neighbors (KNN) Imputation
- Identifies the ‘k’ most similar records and imputes missing values from their averages (or modes).
- Advantages: No assumption about data distribution; works well with non-linear patterns.
- Limitations: Slower for large datasets; sensitive to scaling [@stekhoven2012missforest].
Data Exploration and Visualization
Overview
This section explores the Titanic dataset in detail, analyzing missing values, variable relationships, and patterns using visualizations and statistical summaries.
Data Set
-The Titanic dataset from Kaggle is a widely used dataset for data analysis and machine learning projects. It contains information about the passengers aboard the Titanic, including whether they survived or not. The dataset includes features such as:
- PassengerId: A unique identifier for each passenger.
- Survived: Whether the passenger survived (1) or not (0).
- Pclass: The ticket class (1st, 2nd, or 3rd), which is a proxy for socio-economic status.
- Name: The name of the passenger.
Data set Continued..
- Sex: The gender of the passenger (male or female).
- Age: The age of the passenger.
- Fare: The fare paid for the ticket.
- Cabin: The cabin number (many missing values).
- Embarked: The port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton).
Missing Values Analysis
column_name missing_val missing_percent
PassengerId PassengerId 0 0.0000000
Survived Survived 0 0.0000000
Pclass Pclass 0 0.0000000
Name Name 0 0.0000000
Sex Sex 0 0.0000000
Age Age 177 19.8653199
SibSp SibSp 0 0.0000000
Parch Parch 0 0.0000000
Ticket Ticket 0 0.0000000
Fare Fare 0 0.0000000
Cabin Cabin 687 77.1043771
Embarked Embarked 2 0.2244669
The Titanic dataset has missing values primarily in the Age and Cabin columns, with 177 (19.87%) and 687 (77.10%) missing entries, respectively. The Embarked column has only 2 (0.22%) missing values, which is negligible. All other columns, such as PassengerId, Survived, and Sex, are complete with no missing data, making them reliable for analysis.
Visual representation confirms ‘Cabin’ and ‘Age’ as critical for imputation. Helps prioritize efforts visually. - each column and their missingness compared in bar chart, cabin column has highest number of missing values.
This matrix-style view shows which records have multiple missing fields. Useful to identify patterns of missingness: MCAR, MAR, or MNAR.
Exploratory Visuals
The plot shows that fewer passengers survived . This depiction emphasizes the survival rate imbalance in the data since it strongly affects both analytical processes and model predictions of survival estimates.
- We can see that more female passengers are survived than male passengers from total number of passengers.
The stacked bar chart illustrates the survival rate of passengers by class on the Titanic. First-class passengers had the highest proportion of survivors, followed by second class, while third-class passengers had the lowest survival rate. Notably, the majority of third-class passengers did not survive, highlighting the disparity in survival chances based on socioeconomic status. This suggests that higher-class passengers likely had better access to lifeboats or evacuation assistance. The visualization underscores how passenger class played a significant role in determining survival during the disaster.
Age and Correlation Visuals
- We can see that most survivors are in the 20–40 age range, so missing ages for survivors might be imputed with a value in that range.
- Median Age: The median age of survivors (blue box) is slightly higher than that of non-survivors (red box).
- Age Spread: Both groups have a similar spread of ages, meaning age alone may not be a strong factor in survival. -Most survivors (blue) and non-survivors (red) fall within 20–40 years, with survivors having a marginally higher median age. The overlapping IQRs indicate age’s limited predictive power unless combined with other features (e.g., class or gender).”
The histogram displays the distribution of passenger ages. The majority of passengers fall within the 20–40 age range, with a peak around the early twenties. A smaller number of passengers are younger children or older adults, with very few above 60. The distribution is right-skewed, indicating more younger passengers compared to older ones. This visualization helps understand the overall age demographics and can guide imputation strategies for missing age values.
This provides insights into relationships between key numerical variables in our dataset. Darker red indicates a strong positive correlation, while purple represents a negative correlation. For example, ‘Fare’ and ‘Survived’ show a positive correlation, suggesting passengers who paid higher fares had a higher survival rate. Identifying these relationships helps in understanding data patterns, which is crucial when handling missing values.
Modeling and Results
Overview
This section compares different imputation methods — Mean/Mode, Regression, and KNN — using the Titanic dataset and evaluates their impact on survival prediction using logistic regression. Performance metrics and visual insights are provided.
Age Distribution Comparison
The original dataset shows a natural age spread, with most passengers between 20–40 years. This distribution reflects real variability but contains 177 missing values.
Mean imputation introduces an artificial spike around 28–30 years, replacing all missing ages with the same value. This causes loss of variability and distorts the original pattern.
# weights: 24 (14 variable)
initial value 976.666325
iter 10 value 657.894133
iter 20 value 598.180817
final value 598.176790
converged
Regression-based imputation distributes predicted ages more smoothly across the range, preserving natural variance better than mean imputation.
Original Data shows a natural spread of passenger ages (mostly 20-40 years).
Mean Imputation creates an unrealistic spike at one age (28-30), making the data look artificial.
Regression and KNN Imputation preserve the original pattern better - ages remain spread out like real data.
-Mean imputation distorts the data by forcing all missing ages to be similar, while regression and KNN methods keep the natural variation we see in the original dataset. For accurate analysis, regression or KNN work better than simple mean replacement.
Coefficient Importance
Gender Dominance: “Sexmale” shows the strongest negative impact (-8 to -10 coefficient), confirming males had significantly lower survival rates regardless of imputation method.
Class & Wealth Matter: Pclass (negative) and Fare (positive) reveal poorer passengers (3rd class) fared worse, while wealthier ones (higher fares) survived more—consistent across all methods.
Age’s Role: Age hovers near zero, implying it had limited influence compared to gender/class, though KNN slightly increased its importance (closer to +1).
Port Irrelevance: Embarked features (C/Q/S) show near-zero coefficients, proving boarding location had minimal survival impact—a useful insight for model simplification.
This confirms gender and wealth were primary survival drivers, while age/boarding port were secondary—critical for interpreting Titanic survival patterns accurately.
Accuracy Comparison
![]()
This bar chart compares the accuracy of logistic regression models trained using three different imputation techniques:
KNN Imputation achieved the highest accuracy of 81.0%, showing its ability to capture complex relationships and retain data integrity.
Regression Imputation follows closely with 80.8%, offering a balance of efficiency and accuracy assuming linearity.
Mean/Mode Imputation, while simple, resulted in the lowest accuracy (80.1%), mainly due to the distortion of real-world data patterns.
The results suggest that advanced imputation methods can significantly improve model performance and are preferred in predictive modeling tasks.
Variable Importance
![]()
This horizontal bar chart shows the coefficient estimates from a logistic regression model using Mean/Mode imputed data. Each bar reflects how much a particular feature influences the prediction of survival:
Sex (male) has the most negative impact, confirming that males had lower chances of survival.
Fare has a positive impact—passengers who paid higher fares had better survival chances, often tied to socioeconomic class.
Pclass shows a negative relationship, indicating lower class passengers had lower survival probabilities.
Embarked features have negligible influence, suggesting boarding location did not significantly affect survival.
This insight helps identify which features are most relevant for model prediction and interpretation.
Conclusion
Objective Achieved: Successfully compared three imputation methods (Mean/Mode, Regression, KNN) for handling missing data in the Titanic dataset.
- The project effectively compared three imputation techniques—Mean/Mode, Regression, and KNN—applied to the Titanic dataset.
- Each method was assessed based on its ability to preserve data patterns and improve logistic regression model performance.
- Results showed that:
- KNN Imputation provided the most accurate predictions and best preserved data characteristics.
- Regression Imputation was close in performance but relies on assumptions of linearity.
- Mean/Mode Imputation was the least effective, introducing distortions and artificial uniformity.
- Based on our imputation results, the missingness in the Titanic dataset is most likely Missing at Random (MAR). These results suggest that the missing values are related to other observed variables, not completely random (MCAR) or dependent on unobserved data (MNAR).
Best Imputation Method:
KNN Imputation achieved the highest accuracy (81.0%) by intelligently filling missing values based on similar passengers.
Regression Imputation was a close second (80.8%), using statistical relationships.
Mean/Mode performed worst (80.1%), as it oversimplified the data.
Top Survival Influencers:
Gender (Sex_male) was the strongest predictor—being male drastically reduced survival chances.
Wealth (Fare and Pclass):
Higher fares improved survival.
Lower-class passengers (3rd class) had significantly worse outcomes.
Age had minor impact, while boarding port (Embarked) mattered very little.
Implications
For Data Quality:
Use KNN/Regression for accurate models—they preserve real patterns. Avoid mean imputation for critical tasks.
For Predictive Modeling:
Focus on gender, class, and fare—they drive survival predictions. Ignore weak factors (e.g., Embarked) to simplify models.
For Fairness:
Mean imputation’s distortion of age/class relationships could hide socioeconomic biases. Advanced methods (KNN) mitigate this.
Limitations
Computational Cost: KNN is slower than regression/mean—may not suit huge datasets.
Assumptions: Regression assumes linear relationships; KNN assumes similar neighbors exist.
Generalizability: Results are specific to the Titanic dataset. Test on other datasets to confirm trends.
Final Recommendation
“For reliable predictions, use KNN imputation and prioritize gender/class/wealth features. Mean imputation is only for quick drafts. Always check how missing data handling affects your model’s fairness and accuracy.”
“Missing data isn’t just an inconvenience—it’s a key decision point in any analysis. Thoughtful imputation strengthens not only your model but the credibility of your findings.”
“Let data guide your decisions — even when it’s incomplete.”